Author: Manuele Nolli, student BSc Computer Science SUPSI
Date: 28.11.2022
Mail: manuele.nolli@student.supsi.ch
This document is an analysis of a public dataset found on Kaggle.com
The dataset contains 80k wine reviews with variety, location, winery, price, points, taster nam and description.
My analysis will focus on the following questions:
import numpy as np
import plotly.express as px
import plotly.graph_objs as go
import pandas as pd
from plotly.subplots import make_subplots
df=pd.read_csv("data/winemag2017-2020/winemag2017-2020.csv")
Whit the following code we can see the details of the dataset and how it is structured and the type of the columns.
print(f"---Dataset Info---")
#printing column names
print(f"Total columns: {len(df.columns)}")
print("Columns names:", end=" ")
for col in df:
if col == 'winery':
print(col, end=".")
else:
print(col, end=", ")
print()
#columns types
print(f"Columns type:")
#creating temp array
columnData = []
dfIndexType = []
for col in df.columns:
temp = []
dfIndexType.append(col)
temp.append(df[col].apply(type).unique())
temp.append(df[col].isnull().sum())
columnData.append(temp)
#create new Dataframe
dfColumnsType = pd.DataFrame(columnData, columns=['Types','NaN Count'])
dfColumnsType.index = dfIndexType
#print columns type
display(dfColumnsType)
#df size
print(f"Dataframe rows: {len(df)}")
#df sample
print("Dataset samples:")
df.sample(5)
---Dataset Info--- Total columns: 15 Columns names: country, description, designation, points, price, province, region_1, region_2, taster_name, taster_photo, taster_twitter_handle, title, variety, vintage, winery. Columns type:
| Types | NaN Count | |
|---|---|---|
| country | [<class 'str'>, <class 'float'>] | 5 |
| description | [<class 'str'>] | 0 |
| designation | [<class 'str'>, <class 'float'>] | 21319 |
| points | [<class 'int'>] | 0 |
| price | [<class 'float'>] | 4647 |
| province | [<class 'str'>, <class 'float'>] | 5 |
| region_1 | [<class 'float'>, <class 'str'>] | 12913 |
| region_2 | [<class 'float'>, <class 'str'>] | 49894 |
| taster_name | [<class 'str'>, <class 'float'>] | 150 |
| taster_photo | [<class 'str'>, <class 'float'>] | 150 |
| taster_twitter_handle | [<class 'str'>, <class 'float'>] | 1076 |
| title | [<class 'str'>] | 0 |
| variety | [<class 'str'>] | 0 |
| vintage | [<class 'str'>] | 0 |
| winery | [<class 'str'>] | 0 |
Dataframe rows: 81115 Dataset samples:
| country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_photo | taster_twitter_handle | title | variety | vintage | winery | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 78071 | US | This Cabernet Franc rosé shows a pleasing inte... | Dry | 90 | 17.0 | New York | Finger Lakes | Finger Lakes | Alexander Peartree | https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... | @apatrone23 | Lamoreaux Landing 2019 Dry Rosé (Finger Lakes) | Rosé | 2019 | Lamoreaux Landing |
| 75848 | US | Tannic and oaky, this is rough-hewn with flavo... | Estate | 85 | 30.0 | Oregon | Applegate Valley | Southern Oregon | Paul Gregutt | https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... | @paulgwine | Schmidt 2016 Estate Tempranillo (Applegate Val... | Tempranillo | 2016 | Schmidt |
| 8932 | US | This was barrel fermented, and picked up a fai... | NaN | 87 | 16.0 | Oregon | Elkton Oregon | Southern Oregon | Paul Gregutt | https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... | @paulgwine | River's Edge 2015 Pinot Gris (Elkton Oregon) | Pinot Gris | 2015 | River's Edge |
| 52723 | France | Produced from old vines in a single vineyard, ... | Croix de Montceau | 90 | 35.0 | Burgundy | Saint-Véran | NaN | Roger Voss | https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... | @vossroger | Vignerons des Terres Secrètes 2016 Croix de Mo... | Chardonnay | 2016 | Vignerons des Terres Secrètes |
| 30940 | Italy | Camphor, cedar, coconut and dried aromatic her... | Badarina | 92 | 90.0 | Piedmont | Barolo | NaN | Kerin O’Keefe | https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... | @kerinokeefe | Grimaldi Bruna 2015 Badarina (Barolo) | Nebbiolo | 2015 | Grimaldi Bruna |
It is possible to see that the dataset contains 80k rows and 15 columns. The columns are:
In this section it is possible see the distribution of the wines across the continents. I used the country column to see the distribution of the wines across the continents. I decided to create a new column called continent that contains the continent of the country.
#Continent list
europe = ['Austria', 'Bosnia and Herzegovina','Bulgaria','Croatia','Cyprus','Czech Republic','England', 'France','Germany','Greece','Italy','Luxembourg','Portugal','Hungary', 'Macedonia', 'Moldova', 'Romania', 'Serbia', 'Slovakia', 'Slovenia', 'Spain', 'Switzerland', 'Turkey', 'Ukraine']
asia = ['Armenia', 'China','India','Israel','Lebanon' ]
northAmerica = ['Canada','US','Mexico']
sudAmerica = ['Argentina',',Brazil','Chile','Peru','Uruguay']
oceania = ['Australia','New Zealand']
africa = ['South Africa','Morocco']
other = ['Egypt', 'Georgia']
#Chose to set as 'Other' all the continent with a small amout of reviews
def continentDispacher(row):
if row['country'] in europe:
val = 'Europe'
elif row['country'] in asia:
#val = 'Asia'
val = 'Other'
elif row['country'] in northAmerica:
val = 'North America'
elif row['country'] in sudAmerica:
#val = 'Sud America'
val = 'Other'
elif row['country'] in oceania:
#val = 'Oceania'
val = 'Other'
elif row['country'] in africa:
#val = 'Africa'
val = 'Other'
else:
val = 'Other'
return val
df['continent'] = df.apply(continentDispacher,1)
The following code shows the distribution of the wines across the continents trough a pie chart. It is possible to see that the majority of the wines are produced in Europe, followed by North America.
#Ditrubution of the wines by continent
pieContinent = px.pie(df, names='continent', title='Distribution of wines across continents')
pieContinent.update_traces(textposition='inside', textinfo='percent+label')
pieContinent.update(layout_showlegend=False)
#update layout for export
"""
pieContinent.update_layout(
title={
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
font=dict(
size=18),
height=1000,
width=1000)
"""
pieContinent.show()
#groupby country for have count
dfCountry = df.groupby('country').count().reset_index()
dfCountry = dfCountry[['country','continent']]
dfCountry.columns = ['country','count']
#display dfCountry in a maps
fig = px.choropleth(dfCountry, locations="country", locationmode='country names', color="count", hover_name="country", color_continuous_scale=px.colors.sequential.Plasma)
#more realistic map
fig.update_geos(projection_type="natural earth")
#update layout for enlarge the map
fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0},title = 'Wine distribution across countries')
#update layout for export
"""
fig.update_layout(
title={
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
font=dict(
size=18),
height=1000,
width=2000)
"""
fig.show()
#groupby continent, country, region1 and region2 for have count
dfRegion = df.groupby(['continent','country','region_1','region_2'], dropna=False).count().reset_index()
dfRegion = dfRegion[['continent','country','region_1','region_2','points']]
dfRegion.columns = ['continent','country','region_1','region_2','count']
#I can't find a best way to show the data with region1 or 2 as null
dfRegion.fillna('None', inplace=True)
fig = px.treemap(dfRegion, path=["continent", 'country', 'region_1', 'region_2'],branchvalues="total", values='count', title='Wine distribution across countries')
fig.show()
#create a sunburst chart
fig = px.sunburst(dfRegion, path=["continent", 'country', 'region_1', 'region_2'], values='count', title='Wine distribution across countries')
The above chart is an alternative way to see the distribution of the wines across the continents. It is more interactive and it is possible to see the exact number of wines produced in each continent, country and region.
Another interesting aspect of the dataset is the distribution of the points. The points are given by the tasters and they are on a scale from 80 to 100 and WineEnthusiast has another way to group the wine by 5 categories:
In the following section a new column called pointsDescription is created that contains the description of the score.
#Create new column with points description
def pointsDispacher(points):
if points < 83:
val = 'Acceptable'
elif points < 87:
val = 'Good'
elif points < 90:
val = 'Very good'
elif points <93:
val = 'Excellent'
elif points <97:
val = 'Superb'
else:
val = 'Classic'
return val
#Create new column with points description
df['pointsDescription'] = df['points'].map(pointsDispacher)
#Histogram of points description
pointDistribution = px.histogram(df, x='points', color='pointsDescription', title='Points distribution', height=500,
category_orders=dict(pointsDescription=['Classic', 'Superb', 'Excellent', 'Very good', 'Good','Acceptable']),
labels={
"pointsDescription": "Point Description"
},
color_discrete_map = {'Classic':'#903f5c','Superb':'#006179','Excellent':'#008377','Very good':'#09a259', 'Good':'#90b827', 'Acceptable':'#ffbf00'}
)
#Update axis
pointDistribution.update_xaxes(title='Point',tickmode='linear')
pointDistribution.update_yaxes(title='Count')
#update layout for export
"""
pointDistribution.update_layout(
title={
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
font=dict(
size=18),
height=700,
width=2000)
"""
pointDistribution.show()
From this graph it is possible to see that the majority of the wines are in the Good category, followed by the very good category (the middles scores are the most common).
It is curious to see that there are more wines with 90 points than with 89 points. That is probably because the tasters are more likely to give a wine 90 points than 89 points to have the wine labeled as Excellent.
In this section it is possible to see the distribution of the vintage of the wines. The vintage is the year in which the grapes were harvested.
import datetime
dfVintageWithoutNaN = df.copy()
#Remove 'NV' string = NotVintage, when multiple kind of wine of different years are blended
dfVintageWithoutNaN = dfVintageWithoutNaN[dfVintageWithoutNaN['vintage'] != 'NV']
dfVintageWithoutNaN['vintage'] = pd.DatetimeIndex(dfVintageWithoutNaN['vintage']).year
#Removing impossible data
dfVintageWithoutNaN = dfVintageWithoutNaN[dfVintageWithoutNaN['vintage'] < datetime.datetime.now().year] #year in the future
#Removing wine with year as a title (for doing that I assume that an old wine cost at least 100)
dfVintageWithoutNaN = dfVintageWithoutNaN.drop(dfVintageWithoutNaN[(dfVintageWithoutNaN['vintage'] < 1980) & (dfVintageWithoutNaN['price'] < 100) |(dfVintageWithoutNaN['price'].isna())].index)
#Histogram of vintage distribution
vintageDistribution = px.histogram(dfVintageWithoutNaN, x="vintage", title='Vintage review distribution')
#Update axis
vintageDistribution.update_xaxes(title='Year',dtick=1)
vintageDistribution.update_yaxes(title='Count')
vintageDistribution.show()
It must be remembered that the dataset contains wines reviewed beetwen 2017 and 2020. It is normal to see that the majority of the wines are from the past years. But, there are also some very old wines in the dataset. The oldest wine is from 1931 and surprisely it does not have a very high score.
dfVintageWithoutNaN.loc[dfVintageWithoutNaN['vintage'] == 1931]
| country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_photo | taster_twitter_handle | title | variety | vintage | winery | continent | pointsDescription | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2722 | Portugal | This remarkable wine looks old, and with its d... | Tinto | 89 | 550.0 | Colares | NaN | NaN | Roger Voss | https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... | @vossroger | Adega Viuva Gomes 1931 Tinto Red (Colares) | Ramisco | 1931 | Adega Viuva Gomes | Europe | Very good |
In this section it is possible to see the distribution of the variety of the wines. The variety is the type of grapes used to make the wine (ie Pinot Noir). In the dataset there are many different varieties of wines but I decided to show only the top 10 varieties. It is possible to change this settings by changing the wineCountToShow variable.
Firstly, I created different versions of the dataset that thy will be used to create the graphs.
#Wine to be shown
wineCountToShow = 10
# Top {wineCountToShow} wine variety with the highest count
dfMostWineVariety = df.groupby(['variety']).size().to_frame().sort_values([0], ascending = False).head(wineCountToShow).reset_index()
dfMostWineVariety.columns.values[1] = 'count'
# Other wine variety
dfOtherWineVariety = df.groupby(['variety']).size().to_frame().sort_values([0], ascending = False).tail(len(df.groupby(['variety']).size()) - wineCountToShow).reset_index()
dfOtherWineVariety.columns.values[1] = 'count'
#Create order of bars
order = dfMostWineVariety['variety'].tolist()
order.reverse()
order = ['Other'] + order
# Top {wineCountToShow} wine variety with the highest count and the price
dfFiltered = df.copy()
dfFiltered = dfFiltered.loc[df['variety'].isin(dfMostWineVariety['variety'])]
dfFilteredPoints = dfFiltered.groupby(['variety']).agg({'points': ['mean']}).reset_index()
# Other wine variety
dfFilteredOtherWine = df.loc[df['variety'].isin(dfOtherWineVariety['variety'])]
dfFilteredOtherWinePoints = dfFilteredOtherWine.groupby(['variety']).agg({'points': ['mean']}).reset_index()
Now is finally the time to create the graphs. The left graph is a bar chart that shows the distribution of the wines, the center graph is another bar chart that shows the average points of the wines and the right graph is a box plot that shows the distribution of the prices of the wines.
groupbypoints = df.groupby(['pointsDescription','points']).size().to_frame().reset_index()
groupbypoints.columns.values[2] = 'count'
topReviewedWines = make_subplots(rows=1, cols=3,subplot_titles=('Reviews count',"Variety average points","Price distribution"), shared_yaxes=True,horizontal_spacing = 0.025
)
#Variety average points
trace1 = go.Bar(y=dfFilteredPoints['variety'], x=dfFilteredPoints['points']['mean'],orientation='h',marker_color='rgba(101, 109, 255, 1)')
trace2 = go.Bar(x=[dfFilteredOtherWinePoints['points']['mean'].mean()], y=['Other'], name='Other', orientation='h',marker_color='rgba(55, 83, 109, 0.6)')
#Wine reviews based on variety
trace3 = go.Bar(y=dfMostWineVariety['variety'], x=dfMostWineVariety['count'], name='Top variety', orientation='h',marker_color='rgba(101, 109, 255, 1)')
trace4 = go.Bar(x=[dfOtherWineVariety['count'].sum()], y=['Other'], name='Other', orientation='h',marker_color='rgba(55, 83, 109, 0.6)')
#Price distribution
trace5 = go.Box(x=dfFiltered['price'], y=dfFiltered['variety'], orientation='h',marker_color='rgba(101, 109, 255, 1)')
trace6 = go.Box(x=dfFilteredOtherWine['price'], name='Other', orientation='h',marker_color='rgba(55, 83, 109, 0.6)')
#Add traces
topReviewedWines.add_trace(trace1, row=1, col=2)
topReviewedWines.add_trace(trace2, row=1, col=2)
topReviewedWines.add_trace(trace3, row=1, col=1)
topReviewedWines.add_trace(trace4, row=1, col=1)
topReviewedWines.add_trace(trace5, row=1, col=3)
topReviewedWines.add_trace(trace6, row=1, col=3)
#General layout
topReviewedWines.update_yaxes(categoryorder='array',categoryarray=order)
topReviewedWines.update_layout(showlegend=False)
topReviewedWines.update_layout(title=f'[top {wineCountToShow}] Reviewed wines')
#update title yaxis
topReviewedWines.update_yaxes(title_text='Wine variety', row=1, col=1)
#left graph layout
topReviewedWines.update_xaxes(title_text="Count", col=1)
topReviewedWines.update_xaxes(dtick=5000, col=1)
#center graph layout
topReviewedWines.update_xaxes(title_text="Point", col=2)
topReviewedWines.update_xaxes(range=[80, 100], col=2)
topReviewedWines.update_xaxes(dtick=2, col=2)
#right graph layout
topReviewedWines.update_xaxes(title_text="Price USD", col=3)
topReviewedWines.update_xaxes(type="log", range=[0,4], col=3)
#update layout for export
"""
topReviewedWines.update_layout(
font=dict(
size=25),
height=1000,
width=3000)
topReviewedWines.update_layout(title_font_size=1)
topReviewedWines.update_annotations(font_size=50)
"""
#Finally show the graph
topReviewedWines.show()
It is interesting to see that the other varieties have a lot more reviews than the top 10 varieties, this means that the dataframe is well balanced.
There are two principal graph in this section, the first one show a box plot rappresenting the distribution of the prices by points and the second one show a percentage histogram of the prices grouped by a personal price description:
#Offsetting the price
lowOffset = 10
mediumOffset = 40
expensiveOffset = 100
#Function to create a new column with the price range
def priceDispacher(price):
if price <= lowOffset:
val = 'Low'
elif price <= mediumOffset:
val = 'Medium'
elif price <= expensiveOffset:
val = 'Expensive'
else:
val = 'Luxury'
return val
#Apply priceDispacher function to price column
df['priceDescription'] = df['price'].map(priceDispacher)
boxPricePoint = go.Figure()
boxPricePoint.add_trace(go.Box(x=df['points'], y=df['price'], orientation='v',marker_color='rgba(101, 109, 255, 1)', boxmean=True))
boxPricePoint.update_layout(xaxis_range=[79.5, 100.5], title='Price vs Points')
boxPricePoint.update_xaxes(title='Point', dtick=1)
boxPricePoint.update_yaxes(title='Price USD',type="log")
boxPricePoint.update_yaxes()
#update layout for export
"""
boxPricePoint.update_layout(
font=dict(
size=25),
height=800,
width=3000)
"""
boxPricePoint.show()
By looking at the box plot it is possible to see that the wines with the highest points are the most expensive as could be expected, so there is a strong connection between the price and the points. This is also confirmed by the following histogram that shows that the wines with the highest points are the most expensive.
averagepricePoint = px.histogram(df,x='points', color='priceDescription', barmode='stack', barnorm='percent',
category_orders=dict(priceDescription=['Low', 'Medium', 'Expensive', 'Luxury']), title='Price distribution by points', labels={
"priceDescription": "Price Description"
}, color_discrete_sequence=px.colors.sequential.Teal
)
averagepricePoint.update_xaxes(title='Point', dtick=1)
averagepricePoint.update_yaxes(title='Count %')
#update layout for export
"""
averagepricePoint.update_layout(
font=dict(
size=25),
height=800,
width=3000)
"""
averagepricePoint.show()
It is curious to see that there are some wines with a very high price and a very low points and in the other side there are some wines with a very low price and a very high points. This means that the price is not the only factor that influence the points.
Note: I tried to create a graph object with the past two graph connected by the x-axis but it is currently not possible to do that with plotly. Further information: https://community.plotly.com/t/how-to-set-barmode-for-individual-subplots/47931
Now it is time to see the distribution of the reviewers. I am interested in seeing how many reviewers there are and how many reviews each of them has done. I also want to see if there are some reviewers that are more reliable than others and if there are some reviewers that are more likely to review wines from a specific continent.
from itertools import product
tasterDistribution = make_subplots(rows=1, cols=3,subplot_titles=('Count',"Points distribution","Continent distribution"), shared_yaxes=True,horizontal_spacing = 0.01)
#Taster review count
trace1 = go.Histogram(y=df['taster_name'], name='Taster review count', marker_color='rgba(101, 109, 255, 1)')
#Point awarded
trace2 = go.Box(x=df['points'], y=df['taster_name'], name='Point awarded', orientation='h',marker_color='rgba(101, 109, 255, 1)' )
#Continent preference by taster
#groupby continent and taster and average
dfContinentTaster = df.groupby(['continent','taster_name']).size().reset_index(name='reviewPerContinent')
totReviewPerTaster = df.groupby(['taster_name'])['continent'].count().reset_index(name='totalReview')
##Merge the two dataframe into one##
#create a list of all the possible combination of taster and continent
combs = pd.DataFrame(list(product(df['continent'].unique(), df['taster_name'].unique())),
columns=['continent', 'taster_name'])
#merge dfContinentTaster and combs for all the possible combination (goal: fill the missing value with 0)
dfContinentTaster = dfContinentTaster.merge(combs, how = 'right').fillna(0)
#finally merge with the total review per taster
dfContinentTaster = dfContinentTaster.merge(totReviewPerTaster, on='taster_name')
trace3 = go.Heatmap( x=dfContinentTaster['continent'], y=dfContinentTaster['taster_name'],z=(dfContinentTaster['reviewPerContinent']/dfContinentTaster['totalReview'])*100, name='Continent preference by taster', colorscale='Blues', colorbar=dict(title='Count %'))
#create order by review count
order = df['taster_name'].value_counts().index
#update layout
tasterDistribution.update_yaxes(categoryorder='array',categoryarray=order)
tasterDistribution.update_layout(showlegend=False, title='Taster review')
#layout for the first graph
tasterDistribution.update_xaxes(title='Count', row=1, col=1)
tasterDistribution.update_yaxes(title='Taster name', row=1, col=1)
#layout for the secondo graph
tasterDistribution.update_xaxes(title='Point awarded', dtick=2, range=[79.5, 100.5],row=1, col=2)
#layout for the third graph
tasterDistribution.update_xaxes(title='Continent', row=1, col=3)
#set background color
#add traces to the graph
tasterDistribution.add_trace(trace1, row=1, col=1)
tasterDistribution.add_trace(trace2, row=1, col=2)
tasterDistribution.add_trace(trace3, row=1, col=3)
#update layout for export
"""
tasterDistribution.update_layout(
font=dict(
size=25),
height=1000,
width=3000)
tasterDistribution.update_layout(title_font_size=1)
tasterDistribution.update_annotations(font_size=50)
"""
tasterDistribution.show()
There are different considerations to make:
In this section I decided to represent the most used words in the description of the wines for each point. I used the description column to extract the words after a cleaning process.
#Most used words in wine description for each point
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import string
import re
import matplotlib as mpl
import matplotlib.pyplot as plt
import nltk.corpus
#nltk.download('stopwords')
from nltk.corpus import stopwords
#Function to clean the description
def cleanDescription(description):
#remove punctuation
description = description.translate(str.maketrans('', '', string.punctuation))
#remove number
description = re.sub(r'\d+', '', description)
#remove space
description = description.strip()
#remove stopword
description = [word for word in description.split() if word not in stopwords.words('english')]
#remove short word
description = [word for word in description if len(word) > 2]
#remove word with number
description = [word for word in description if not any(c.isdigit() for c in word)]
#remove word with special character
description = [word for word in description if not any(c in string.punctuation for c in word)]
#remove string The (trivial word)
description = [word for word in description if not word == 'The']
#remove string Wine (trivial word)
description = [word for word in description if not word == 'Wine']
description = [word for word in description if not word == 'wine']
#remove string This (trivial word)
description = [word for word in description if not word == 'This']
#remove word with underscore
description = [word for word in description if not any(c == '_' for c in word)]
#remove word with dash
description = [word for word in description if not any(c == '-' for c in word)]
#remove word with slash
description = [word for word in description if not any(c == '/' for c in word)]
#remove word with backslash
description = [word for word in description if not any(c == '\\' for c in word)]
#remove word with dot
description = [word for word in description if not any(c == '.' for c in word)]
#remove word with comma
description = [word for word in description if not any(c == ',' for c in word)]
#remove word with colon
description = [word for word in description if not any(c == ':' for c in word)]
#remove word with semicolon
description = [word for word in description if not any(c == ';' for c in word)]
#remove word with exclamation mark
description = [word for word in description if not any(c == '!' for c in word)]
#remove word with question mark
description = [word for word in description if not any(c == '?' for c in word)]
return description
#Function to create the wordcloud
def createWordCloud(description, title):
#create wordcloud
wordcloud = WordCloud(width = 500, height = 500,
min_font_size = 10,
background_color ='white').generate(description)
# plot the WordCloud image
#plt.figure(figsize = (25,25), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
#plt.title(title, fontsize=50)
plt.show()
#Function to create the wordcloud for each point
def createWordCloudForPoint(df, pointsDescription):
#filter by point
dfPoint = df.loc[df['pointsDescription'] == pointsDescription]
#clean description
dfPoint['description'] = dfPoint['description'].apply(lambda x: cleanDescription(x))
#join all the description
description = ''.join(' '.join(l) for l in dfPoint['description'].values)
########################
#Remove the comment below to save in a dataframe the most used word for each point
#find most 10 used word in the description save it in a dataframe
#dfMostUsedWord = pd.DataFrame(description.split(), columns=['word']).word.value_counts().reset_index().rename(columns={'index':'word', 'word':'count'}).head(10)
#print(point)
#display(dfMostUsedWord)
########################
#create wordcloud
createWordCloud(description, 'Most used words of \'' + str(pointsDescription) + '\' category')
#remove warning
pd.set_option('mode.chained_assignment', None)
#Create wordcloud for each point
for point in set(df['pointsDescription']):
print(point)
createWordCloudForPoint(df, point)
#reset warning
pd.reset_option('mode.chained_assignment')
Acceptable
Good
Classic
Very good
Superb
Excellent